Goto

Collaborating Authors

 action localization


VideoCapsuleNet: A Simplified Network for Action Detection

Kevin Duarte, Yogesh Rawat, Mubarak Shah

Neural Information Processing Systems

Wepropose a 3D capsule network for videos, called VideoCapsuleNet: a unified network for action detection which can jointly perform pixel-wise action segmentation along with action classification. The proposed network is a generalization of capsule network from 2D to 3D, which takes a sequence of video frames as input. The 3D generalization drastically increases the number of capsules in the network, making capsule routing computationally expensive.



Areall FramesEqual? ActiveSparseLabelingfor VideoActionDetection

Neural Information Processing Systems

Wedemonstratethattheproposed approach performs better than random selection, outperforming all other baselines, with performance comparable tosupervised approach using merely 10%annotations.


MambaTAD: When State-Space Models Meet Long-Range Temporal Action Detection

Lu, Hui, Yu, Yi, Lu, Shijian, Rajan, Deepu, Ng, Boon Poh, Kot, Alex C., Jiang, Xudong

arXiv.org Artificial Intelligence

Abstract--T emporal Action Detection (T AD) aims to identify and localize actions by determining their starting and ending frames within untrimmed videos. Recent Structured State-Space Models such as Mamba have demonstrated potential in T AD due to their long-range modeling capability and linear computational complexity. On the other hand, structured state-space models often face two key challenges in T AD, namely, decay of temporal context due to recursive processing and self-element conflict during global visual context modeling, which become more severe while handling long-span action instances. This paper presents MambaT AD, a new state-space T AD model that introduces long-range modeling and global feature detection capabilities for accurate temporal action detection. MambaT AD comprises two novel designs that complement each other with superior T AD performance. First, it introduces a Diagonal-Masked Bidirectional State-Space (DMBSS) module which effectively facilitates global feature fusion and temporal action detection. Second, it introduces a global feature fusion head that refines the detection progressively with multi-granularity features and global awareness. In addition, MambaT AD tackles T AD in an end-to-end one-stage manner using a new state-space temporal adapter(SST A) which reduces network parameters and computation cost with linear complexity. Extensive experiments show that MambaT AD achieves superior T AD performance consistently across multiple public benchmarks. Emporal action detection (T AD) aims to detect specific action categories and extract corresponding temporal spans in untrimmed videos. It is a long-standing and challenging problem in video understanding with extensive real-world applications such as sports analysis, surveillance and security. The development of deep neural networks such as CNNs [1], [2] and Transformers [3], [4] has led to continuous advancements in T AD performance over the past few years. However, CNNs have limited capabilities in capturing long-range dependencies, while Transformers face challenges with computational complexity and feature discrimination [1]. Hui Lu and Yi Y u are with the Rapid-Rich Object Search Lab, Interdisciplinary Graduate Programme, Nanyang Technological University, Singapore, (e-mail: {hui007, yuyi0010}@e.ntu.edu.sg).





184260348236f9554fe9375772ff966e-Reviews.html

Neural Information Processing Systems

"NIPS 2013 Neural Information Processing Systems December 5 - 10, Lake Tahoe, Nevada, USA",,, "Paper ID:","1139" "Title:","Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization" Reviews First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. This paper proposes a method for action detection (localization and classification of actions) using weakly supervised information (action labels + eye gaze information, no explicit definition of bounding boxes). Overall, the spatio-temporal search (a huge spatio-temporal space) is done using dynamic programming and a max-path algorithm. Gaze information is introduced into framework through a loss which acounts for gaze density at a given location. QUALITY: The paper seems technically sound and makes for a nice study given gaze information.